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Abstract 



We propose a probabilistic framework for interpreting and developing hard thresholding sparse signal recon- 
struction methods and present several new algorithms based on this framework. The measurements follow an 
underdetermined linear model, where the regression-coefficient vector is the sum of an unknown deterministic 
sparse signal component and a zero-mean white Gaussian component with an unknown variance. We first derive an 
expectation-conditional maximization either (ECME) iteration that guarantees convergence to a local maximum of the 
likelihood function of the unknown parameters for a given signal sparsity level. To analyze the reconstruction accuracy, 
we introduce the minimum sparse subspace quotient (SSQ), a more flexible measure of the sampling operator than the 
well-established restricted isometry property (RIP). We prove that, if the minimum SSQ is sufficiently large, ECME 
achieves perfect or near-optimal recovery of sparse or approximately sparse signals, respectively. We also propose a 
double overrelaxation (DORE) thresholding scheme for accelerating the ECME iteration. If the signal sparsity level is 
unknown, we introduce an unconstrained sparsity selection ( USS) criterion for its selection and show that, under certain 
conditions, applying this criterion is equivalent to finding the sparsest solution of the underlying underdetermined 
linear system. Finally, we present our automatic double overrelaxation (ADORE) thresholding method that utilizes 
the USS criterion to select the signal sparsity level. We apply the proposed schemes to reconstruct sparse and 
approximately sparse signals from tomographic projections and compressive samples. 



Expectation-conditional maximization either (ECME) algorithm, iterative hard thresholding, sparse signal recon- 
struction, sparse subspace quotient, unconstrained sparsity selection, overrelaxation. 



Sparsity is an important concept in modem signal processing. Sparse signal processing methods have 
been developed and applied to biomagnetic and magnetic resonance imaging, spectral estimation, wireless 
sensing, and compressive sampling, see Jil-llHl and references therein. For noiseless measurements, the 
major sparse signal reconstruction task is finding the sparsest solution of an underdetermined linear system 
y = H s (see e.g. eq. (2)]): 



where y is an x 1 measurement vector, H is a known N x m full-rank sensing matrix with N < m, s 
is an m X 1 unknown signal vector, and ||s||^(, counts the number of nonzero elements in the signal vector 
s. The (Po) problem requires combinatorial search and is known to be NP-hard [|9J. 

A number of tractable approaches have been proposed to find sparse solutions to underdetermined 
systems. They can be roughly divided into three groups: convex relaxation, greedy pursuit, and probabilistic 



Index Terms 



I. Introduction 




min subject to y = H s 



(1.1) 
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methods. Convex methods replace the £o-norm penalty with the £i-norm penalty and solve the resulting 
convex optimization problem. Basis pursuit (BP) directly substitutes Iq with ii in the (Pq) problem, see 
[fTOl . To combat measurement noise and accommodate for approximately sparse signals, several methods 
with various optimization objectives have been suggested, e.g. basis pursuit denoising (BPDN) [[TTll . IfTOl 
and Dantzig selector ifTIIl . The gradient projection for sparse reconstruction (GPSR) algorithm in lfT3l 
solves the unconstrained version of the BPDN problem in a computationally efficient manner. Greedy 
pursuit methods approximate the (Pq) solution in an iterative manner by making locally optimal choices. 
Orthogonal matching pursuit (OMP) [fT4l . [fTSl . [fT6l . compressive sampling matching pursuit (CoSaMP) 
[[TtI . and iterative thresholding schemes [[T8l -[[2T | belong to this category. Probabilistic methods utilize 
full probabilistic models and statistical inference tools to solve the sparse signal reconstruction problem. 
Examples of the methods in this group are: sparse Bayesian learning (SBL) [i22l . Bayesian compressive 
sensing (BCS) ll23l and expansion-compression variance-component based method (ExCoV) [[24|. Most 
existing sparse signal reconstruction schemes require tuning ll25l . where the reconstruction performance 
depends crucially on the choice of the tuning parameters. 

Iterative hard thresholding (IHT) and normalized iterative hard thresholding (NIHT) algorithms in [fT9l - 
[[2T| (see also [18]) have attracted significant attention due to their low computation and memory require- 
ments and theoretical and empirical evidence of good reconstruction performance. The IHT and NIHT 
methods require only matrix-vector multiplications and do not involve matrix-matrix products, matrix 
inversions, or solving linear systems of equations. The memory needed to implement IHT and NIHT is just 
0{Nm), and can be further reduced to 0{m) if the sensing operator H is realized in a function-handle 
form. However, the IHT and NIHT methods 

• converge slowly, demanding a fairly large number of iterations, 

• require the knowledge of the signal sparsity level, which is a tuning parameter, and 

• are sensitive to scaling of the sensing matrix (IHT) or require elaborate adjustments in each iteration 
to compensate for the scaling problem (NIHT). 

IHT and NIHT guarantee good recovery of the underlying sparse signal if the sensing matrix satisfies 
the restricted isometry property (RIP) and a modified non- symmetric RIP; see fl^ and |I2IT|, respectively. 
The restricted isometry property was introduced in [|7| to measure how well sparse vectors preserve their 
magnitudes after being transformed by the sensing matrix H. To preserve this magnitude for a sparsity 
level r, any r columns of H must be approximately orthonormal, which corresponds to H having a small 
restricted isometry constant (RIC). We refer to this requirement as the RIP condition. (See Section ITVl for 
the definition of the RIC for sparsity level r, statement of the corresponding RIP condition, and further 
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discussion.) Besides being used to analyze the IHT and NIHT schemes, the RIP condition is a common 
ingredient of reconstruction performance analyses of many sparse reconstruction methods, e.g. convex 
methods Q, El, lEl and CoSaMP [UTI. However, 

• the RIP condition is quite restrictive: a simple linear transform or even a scaling of by a constant 
can easily break the equilibrium required by RIP. 

The contribution of this paper is four-fold. 

1. Probabilistic model. We propose a probabilistic framework for generalizing iterative hard thresholding 
(IHT) algorithms and interpreting them as expectation-conditional maximization either (ECME) iterations, 
see also [|26l . If the rows of the sensing matrix H are orthonormal, the signal update of the ECME iteration 
is equivalent to one IHT step. Note that IHT is a greedy pursuit scheme whereas ECME is a probabilistic 
scheme; hence, our framework blurs the boundary between the two categories. 

2. Analysis. We prove that our ECME iteration monotonically converges to a fixed point corresponding 
to a local maximum of the marginal likelihood function under our probabilistic model. The conditions that 
we use in this convergence analysis are invariant to invertible linear transforms of either the rows or the 
columns of H, which indicates that the convergence of our ECME iteration is robust to linear transforms 
of H. Such a convergence robustness to linear transforms and scaling of H is in contrast to the IHT 
convergence analysis in [[T9l Theorem 4] that requires the spectral norm of H to be strictly less than one. 

We also provide perfect and near-optimal guarantees for the recovery of sparse and approximately sparse 
signals, respectively. Our signal recovery analysis does not rely on the common assumption that H has 
a sufficiently small RIC; rather, we introduce new measures of H useful for reconstruction analysis: the 
r-sparse subspace quotient (r-SSQ) and minimum r-SSQ. The minimum r-SSQ measures how well sparse 
vectors with sparsity level r preserve their magnitudes after being projected onto the row space of H, see 
Section HYI Unlike the RIC, the minimum r-SSQ is invariant to invertible linear transforms of the rows of 
H. We prove that, if the minimum 2r-SSQ of the sensing matrix is larger than 0.5, our ECME algorithm 
for sparsity level r 

• perfectly recovers the true r-sparse signal from noiseless measurements and 

• estimates the best r-term approximation of an arbitrary non-sparse signal from noisy measurements 
within a bounded error. 

Due to the row transform invariance of the minimum r-SSQ, our reconstruction analysis allows for sensing 
matrices that violate the RIP condition: the columns of the sensing matrices can have arbitrary magnitudes 
and be highly correlated. Therefore, our results widen the scope of sensing matrices that allow perfect or 
satisfactory sparse reconstruction performance via tractable algorithms. 
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3. Convergence acceleration. We develop a double overrelaxation (DORE) thresholding method that 
interleaves two overrelaxation steps with ECME steps, see also [27] . DORE significantly accelerates the 
convergence of the ECME algorithm (and therefore of the IHT method as well, which is its special case). The 
line searches in the overrelaxation steps have closed-form solutions, making these steps computationally 
efficient. The theoretical convergence and reconstruction properties of ECME in 2 (above) apply to the 
DORE method as well. 

4. Signal sparsity level selection. Finally, we propose an automatic double overrelaxation (ADORE) 
thresholding method that does not require the knowledge of the signal sparsity level. To automatically 
select the sparsity level (i.e. estimate it from the data), we introduce an unconstrained sparsity selection 
(USS) model selection criterion. We prove that, under certain mild conditions, the unconstrained criterion 
USS is equivalent to the constrained (Pq) problem (11.11 ). ADORE combines the USS criterion and DORE 
iteration and applies a golden- section search to maximize the USS objective function. 

In Section [III we introduce our two-stage hierarchical probabilistic model and the ECME thresholding 
algorithm (Section [11- Al) . Our convergence and near-optimal reconstruction analyses of the ECME iteration 
are presented in Sections III] and |Wl respectively. In Section |Vl we describe the DORE thresholding method 
for accelerating the convergence of the ECME iteration. In Section |Vll we introduce the USS criterion and 
our ADORE thresholding scheme (Section IVI-AI) . In Section IVII[ we compare the performances of the 
proposed and existing large-scale sparse reconstruction methods via numerical experiments. Concluding 
remarks are given in Section IVIIII 

A. Notation and Terminology 

We introduce the notation used in this paper: 

• Mi^y ; /X, E) denotes the multivariate probability density function (pdf) of a real-valued Gaussian 
random vector y with mean vector fi and covariance matrix S; 

• I ■ |, II ■ ll^p, det(-), denote the absolute value, £p norm, determinant, and transpose, respectively; 

• the smallest integer larger than or equal to a real number x is \x~\ ; 

• /„, 0„xi, and Onxm are the identity matrix of size n, the n x 1 vector of zeros, and the n x m matrix 
of zeros, respectively; 

• Amin(X) and Aniax(-^) are the minimum and maximum eigenvalues of a real-valued symmetric square 
matrix X; 

• spark(i7) is the smallest number of linearly dependent columns of a matrix H [[8l; 

• Ha denotes the restriction of the matrix H to the index set A, e.g. if A = {1,2,5}, then = 
[hi h2 h^], where hi is the ith column of H; 
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• sa is the restriction of a column vector s to the index set A, e.g. if A = {1, 2, 5}, then sa = [si, S2, s^]^, 
where Sj is the ith element of s; 

• dim{A) denotes the size of a set A; 

• supp(a;) returns the support set of a vector x, i.e. the index set corresponding to the nonzero elements 
of X, e.g. supp([0, 1, -5, 0, 3, 0]^) = {2, 3, 5}; 

• the thresholding operator Tr{x) keeps the r largest-magnitude elements of a vector x intact and sets 
the rest to zero, e.g. r2([0, 1, -5, 0, 3, 0]^) = [0, 0, -5, 0, 3, 0]^. 

We refer to an iV x m sensing matrix H as proper if it has full rank and 

N <m (1.2) 

which implies that the rank of H is equal to N. Throughout this paper, we assume that sensing matrices 
H are proper, which is satisfied in almost all practical sparse signal reconstruction scenarios. 

II. Probabilistic Model and the ECME Algorithm 
We model di N xl real-valued measurement vector y as 

y = Hz (2.1a) 

where if is an A^^ x m real-valued proper sensing matrix, z is an m x 1 multivariate Gaussian vector with 
pdf 

p,ig{z\0)^J\f{z; s,a^Im) (2.1b) 

s = [si,S2, ■ ■ ■ , Sm]'^ is an unknown m x 1 real-valued sparse signal vector containing at most r nonzero 
elements (r < m), and o"^ is an unknown variance-component parameter, we refer to r as the sparsity level 
of the signal and to the signal s as being r-sparse. Note that ||s||^o — dim(supp(s)) counts the number of 
nonzero elements in s; we refer to ||s||^o as the support size of s. Therefore, the support size ||s||4 of the 
r-sparse vector s is less than or equal to the sparsity level r. The set of unknown parameters is 

e = (s, (j2) e (2.2) 

with the parameter space 

Qr^SrX [0, +oo) (2.3a) 

where 

Sr^{seTU^: ||s||4 < r} (2.3b) 
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is the sparse signal parameter space. The marginal likelihood function of is obtained by integrating z 
out [see (|2?T]) 1: 

Py\e{y\0)=^f{y■,H s, a' H H^) (2.4a) 

where the fact that H is a proper sensing matrix ensures that H is invertible and, consequently, that 
the pdf (I2.4al) exists. For a given sparsity level r, the maximum likelihood (ML) estimate of 6 is 

^ML(r) = {sML{r),al^{r)) = aigmax Py\e{y \ 0). (2.4b) 

For any fixed s, the marginal likelihood (I2.4al) is maximized by 

d\s) = {y~H sf {H H^)~' {y-Hs)/N. (2.5) 

Therefore, maximizing (|2.4al) with respect to 6 is equivalent to first maximizing the concentrated likelihood 
function 

Py\e{y\s,d\s)) = J exp(-0.5iV) (2.6) 

y'aet[2 71 H ) 

with respect to s G S,-, yielding Sml('"), and then determining the ML estimate of cr^ by substituting Sml('") 
into (12.51) . Obtaining the exact ML estimate 0ml('") in (I2.4bl) requires a combinatorial search and is therefore 
infeasible in practice. We now present a computationally feasible iterative approach that aims at maximizing 
(I2.4al) with respect to 6 E Qr and circumvents the combinatorial search. 

A. ECME Algorithm For Known Sparsity Level r 

We treat z as the missing (unobserved) data and present an ECME algorithm for approximately finding 
the ML estimate in (|2.4bl) . assuming a fixed sparsity level r. Since the sparsity level r is assumed known, 
we simplify the notation and omit the dependence of the estimates of on r in this section and in 
Appendix |Al An ECME algorithm maximizes either the expected complete-data log-likelihood function 
(where the expectation is computed with respect to the conditional distribution of the unobserved data 
given the observed measurements) or the actual observed-data log-likelihood, see [|3T1 Ch. 5.7]. 

Assume that the parameter estimate 0^^^ = [s^^\ (cr^)*^^^) is available, where p denotes the iteration index. 
Iteration p + 1 proceeds as (see Appendix lAl for its derivation): 

• update the sparse signal estimate using the expectation-maximization (EM) step, i.e. the expectation 
(E) step: 

= E,\y^e[z I y, 6>(P)] = s^^^ + {H H^y' (y - H s^^^) (2.7a) 
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followed by the maximization (M) step, which simplifies to 

g(p+i) = argmin - s\\j = Triz^P^^'') (2.7b) 

seSr 

and 

• update the variance component estimate using the following conditional maximization (CM) step: 

= {y-Hs^P+'Y{HH^)-'{y-Hs^P+'^)/N (2.7c) 

obtained by maximizing the marginal likelihood (|2.4al ) with respect to for a fixed s = 5^^+^), see 
(|231) . 

In (|2.7a| ). E |y ^[^ | 0] denotes the mean of the pdf pz\y^0{z\y,6), which is the Bayesian minimum 
mean-square error (MMSE) estimate of z for known 6 [[33l Sec. 11.4]. Note that {H H^)~^ can be pre- 
computed before the iteration starts or well approximated by a diagonal matrix; hence, our ECME iteration 
does not require matrix inversions. See Section IV-AI for detailed discussion on the complexity of the ECME 
method. If the rows of the sensing matrix H are orthonormal: 

HH^ = In (2.8) 

then the EM step in (I2.7al) - (l2.7bl) is equivalent to one iterative hard-thresholding (IHT) step in GOl eq. 
(10)]. 

The above ECME algorithm does not satisfy the general regularity conditions assumed in standard 
convergence analysis of the EM-type algorithms in e.g. [1311 and ll32l Theorem 2]. In particular, 

• the complete-data and conditional unobserved data given the observed data distributions p^.y \0{z,y\0) 
and pz I y,9iz I y, 6) are both degenerate, see (I2.1al) and Appendix |Al 

• the parameter space 0,. is non-convex and its interior is empty; 

• in ©r, the partial derivatives of the marginal likelihood (I2.4al) with respect to the components of s do 
not exist for most directions. 

Therefore, we establish the convergence of our ECME iteration afresh in the following section. 

III. Convergence Analysis of the ECME Algorithm 

We now answer the following questions. Does the ECME iteration in Section III- A I ensure monotonically 
non-decreasing marginal likelihood (|2.4al) . does it converge to a fixed point and, if yes, is this fixed point 
a local or the global maximum of the marginal likelihood function? How do we define a local maximum 
in the parameter space 6^ in (I2.3al) ? Since the sparsity level r is fixed, we omit the dependence of the 
estimates of on r in this section and in Appendices |B] and O that contain the proofs of the results of this 
section. 
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Maximizing the concentrated likelihood function (12.61) with respect to s G Sr is equivalent to minimizing 
the weighted squared error 

S{s) = Na\s) = {y - H sf [H H^)-' {y -Hs). (3.1) 
The following identity holds for all s G 7^™ and s' G 7?.™: 

S{s) = Q{s\s') -H{s\s') (3.2a) 

where 

Q(s|s') = \\s' + H^{HH^)-\y-Hs')-s\\l (3.2b) 
nisls') = {s-s'f[Im-H^{HH^)-'H]{s-s'). (3.2c) 

This identity follows by rewriting (1120) as Q{s \ s') = \\{Im- H^iHH^)-^H){s' - s) + H^{HH^)-\y- 
Hs)\\'j^ and expanding the squares. Observe that H^s \ s') is minimized at s = s'. 

Denote by s^'^ the estimate of s obtained in Iteration p of our ECME iteration. When we set s' = s^p\ 
Q{s I s^P^) = \\z^P^^^ — s\\j^ becomes exactly the expression that is minimized in the M step (I2.7bl) and, 
consequently, 

g(s(p+i) I s(p)) < Q^giP) I s(p)). (3.3a) 
Since T-L^s] s^^'^) is minimized at s = we have 

Subtracting (I3.3al) from (I3.3bl) and using (I3.2al) yields 

^(s(P+i)) < ^(s(p)) (3.4) 

and, therefore, our ECME iteration (12.71) ensures a monotonically non-decreasing marginal likelihood (I2.4a|) . 
see also (12.61) . Monotonic convergence is also a key general property of the EM-type algorithms [|3T1l . 
Furthermore, since (|3.1I) is bounded from below by zero, the sequence 8{s'^'^^) must converge to a limit as 
the iteration index p grows to infinity. 

However, the fact that £{s^p^) converges does not necessarily imply that s^^ converges to a fixed point. 
The following theorem establishes convergence of the ECME signal iterates s^p\ 

Theorem 1: Assume that the sparsity level r satisfies 

r<lim-N) (3.5a) 
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and that the sensing matrix H satisfies the unique representation property (URP) [JJ stating that all N x N 
submatrices of H are invertible or, equivalently, that 



Then, the ECME signal iterate s^^^ for sparsity level r converges monotonically to its fixed point as the 
iteration index p grows to infinity. 



Note that (I3.5al) is a mild condition. In practice, N <^ m and (|3.5al) specifies a large range of sparsity 
levels r for which the ECME iteration converges to its fixed point. 

Theorem [T] guarantees the convergence of our ECME iteration to a fixed point. However, can we guarantee 
that this fixed point is a local or the global maximum of the marginal log-likelihood function (I2.4al) ? To 
answer this question, we first define the local maximum of a function over the parameter space in (|2.3b|) . 

Definition 1: r-local maximum and minimum. For a function /(s) : 7^™ —^TZ,a vector s* E Sr is an 
r-local maximum point of /(s) if there exists a 5 > 0, such that, for all s E Sr satisfying ||s — s^W^^ < 5, 
we have 



Then, /(s*) is the corresponding r-local maximum of /(s). We define s* E Sr and /(s*) as an r-local 
minimum point and the corresponding r-local minimum of /(s) if s* is an r-local maximum point for the 
function —f{s). 

Definition [T] states that an r-sparse vector is a r-local maximum (or minimum) point of a function f{s) 
if, in some small neighborhood, this vector attains the largest (or smallest) function value among all the 
sparse vectors within that small neighborhood. Fig. [U illustrates this concept using s = [si, ^2]^ (i.e. m = 2) 
and /(s) = exp{-0.5 [(si + 0.5)^ + (s2 - 0.7)^]}. For the sparsity level r = 1, the points a = [-0.5, 0]"^ 
and b = [0,0.7]^ are the only two 1-local maximum points of /(s). Observe that a and b are not local 
maximum points of f{s) when s is unconstrained in 7^^. 

The following lemma provides a necessary condition for an r-local maximum or minimum point of a 
differentiable function. 

Lemma 1: If an r-sparse vector s* E Sr is an r-local maximum or minimum point of a differentiable 
function /(s) : 7^™ — )■ TZ, then, for alH G {1, 2, . . . , m} such that 



spark(if) = + 1. 



(3.5b) 



Proof: See Appendix IB 



□ 



dim({z} U supp(s*)) < r 



(3.6a) 



we have 



df{s) 



0. 



(3.6b) 



dsi s=s* 
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Fig. 1. Function /(s) — exp{— 0.5 [(si + 0.5)^ + (s2 — 0.7)^]} with s = [si, 82]"^ and two 1-Iocal maxima of f{s). 

Proof: See Appendix O □ 
The condition (I3.6al) of Lemma [T] implies that, instead of checking that all partial derivatives of our 
function are zero (which is required in the standard first-derivative test for finding local maxima and 
minima), we only need to check its derivatives along a few allowed coordinate axes, where the allowed 
coordinate axes are defined by the property that perturbing along these axes does not violate the sparsity 
requirement, see (I3.6al) . If s* has exactly r nonzero elements, then i in (|3.6al) must belong to supp(s'^), 
and we should only check the r partial derivatives that correspond to the nonzero components of s*. For 
example, consider Fig.[T} to determine if a = [—0.5, 0]^ is a 1-local maximum point, we only need to check 
that the partial derivative of /(s) with respect to Si is zero at s = a; the direction along the S2 axis is not 
allowed because the perturbation along this direction violates the sparsity requirement. However, when s* 
has less than r nonzero elements, we must check all partial derivatives, because perturbing along any axis 
will not exceed the sparsity requirement. 

We now provide a sufficient condition for an r-local maximum or minimum point of a twice differentiable 
function. 

Lemma 2: An r-sparse vector s* E Sr is an r-local maximum or minimum of a twice differentiable 
function /(s) : 7^" 7^ if 
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(1) for all i G {1, 2, . . . , m} such that dim({i} U supp(s*)) < r, we have 

df{s) 



= (3.7) 

s=s* 



dsi 
and 

(2) there exists a 5 > 0, such that, for all s G Sj. satisfying \\s — s*\\i,^ < 5, the Hessian matrix 

ds ds'^ 

is negative semidefinite (for a maximum) or positive semidefinite (for a minimum). 
Proof: See Appendix O □ 
In the example depicted in Fig. [H both points a and b satisfy the two conditions of Lemma [2] and 
are therefore the r-local maxima of /(s). Lemma |2] is useful in developing the following theorem stating 
that our ECME algorithm actually converges to an r-local maximum point of the concentrated marginal 
likelihood function (12.61) . 

Theorem 2: If the sensing matrix H is proper and ECME iteration in Section III-AI converges to a fixed 
point 6* = (s*, (cr^)*), then s* is an r-local maximum point of the concentrated marginal likelihood function 
(12:61) . 

Proof: See Appendix O □ 
Based on Theorems \T\ and [2l we claim that, if H satisfies the URP condition (I3.5bl) and for a sufficiently 
small sparsity level r, the ECME algorithm in Section III-AI converges to a fixed point that is an r-local 
maximum of the concentrated marginal likelihood function (12.61 ). 

The conditions of Theorems [T] and [2] hold even when the sensing matrix H is pre- or post- multiplied 
by a full rank square matrix. In contrast, the IHT algorithm converges to a local minimum of the squared 
residual error for a specified sparsity level only if H is appropriately scaled. Indeed, Theorem 4 in lfT9l 
demands that the spectral norm of the sensing matrix H is less than unity. If the spectral norm condition 
is violated, the IHT iteration may become unstable and diverge, see [|2T1 Sec. II-D]. To overcome such 
scaling requirements and ensure convergence for an arbitrary scaled H, a normalized IHT (NIHT) method 
has been proposed in ll2Tll . where a scaling term is introduced to the original hard thresholding step; this 
term must be monitored and adjusted in each iteration so that it does not exceed a certain threshold (see 
[ j2Tl e.q. 14]); otherwise, the squared residual error [19, eq. (1.6)] is not guaranteed to decrease during 
the iteration. However, this monitoring and adjustment consume CPU time and typically slow down the 
resulting algorithm, see the numerical examples in Section IVIII In contrast. Theorems [T] and [2] assert that 
the monotonic convergence of our ECME iteration is not affected by the pre- and post-multiplication of H 
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by any full-rank square matrix of appropriate size, thus removing the need for monitoring and adjustment 
within the iteration steps. 

IV. Sparse Subspace Quotient and Near-optimal ECME Reconstruction 

We now study theoretical guarantees for near-optimal ECME reconstruction. We first define the r-sparse 
subspace quotient (r-SSQ) as a normalized squared magnitude of the projection of an r-sparse signal onto 
the row space of the sensing matrix. We introduce minimum r-sparse subspace quotient of the sensing 
matrix as a separability measure for arbitrary r-sparse signals, discuss its properties, compare it with the 
existing popular measures such as restricted isometry and coherence, and use it to establish a condition 
for uniqueness of the solution to the (Pq) problem. We then show that, in the absence of noise and if 
the minimum 2r-sparse subspace quotient is sufficiently large, our ECME algorithm estimates the true 
unknown r-sparse signal perfectly from the linear measurements. We also give an example of the existence 
of low-dimensional matrices that satisfy our perfect recovery requirement (Section [I V-AI) . We finally show 
that, for non-sparse signals and noisy measurements, the ECME iteration for the sparsity level r recovers 
the best r-term approximation of the true signal within a bounded error. Both the noiseless and noisy 
reconstruction guarantees hold regardless of the initial estimate of the signal parameters 6 employed by the 
ECME iteration. 

Definition 2: r-Sparse Subspace Quotient (r-SSQ) and minimum r-SSQ. We define the r-sparse 
subspace quotient of a nonzero r-sparse vector s of size m x 1 (i.e. s E Sr\Omxi) and a proper N x m 
sensing matrix H as the ratio of the squared magnitude of the projection of s onto the row space of H 
and the squared magnitude of s: 

Pr{s,H) = ^ = '——^ . (4.1a) 

Define the corresponding minimum r-sparse subspace quotient of the sensing matrix H as 

Pr,mmiH) = mm pr{s,H). (4.1b) 

m X 1 

Note that (HH^)~^ H is the projection matrix onto the row space of H and the second equality in 
(I4.1al) follows from the fact that the projection matrix is idempotent. 

The following lemma summarizes a few useful properties of r-SSQ and minimum r-SSQ. 

Lemma 3: For an N x m proper sensing matrix H, a nonzero r-sparse vector s of size m x 1, and a 
sparsity level r satisfying < r < m, 

(a) pr{s,H) in (I4.1al) can be equivalently defined as 

„.fs,„^^^mifrlH^ ,4.2a) 
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where A = supp(s) is the support set of s and Pr,mm{H) can be determined by the following 
equivalent optimization: 

PrMniH) = min A^in (ffj {H H^y' H^) ; (4.2b) 

ylC{l,2,...,m},dim(A)=r V ^ ^ / 

(b) pr{s,H) and Pr,mm{H) are invariant to invertible linear transforms of the rows of H, i.e. 

Pr{s,H) = Pr{s,GH), PrMn{H) = PrMn{G H) (4.3) 

for any full-rank N x N matrix G; 

(c) pr{s,H) and pr,mm{H) are bounded as follows: 

0<Pr,mUH)<Pr{s,H)<l (4.4) 

where pr,mm{H) attains 

• the lower bound when r > N and 

• the upper bound 1 when N = m; 

(d) if and only if H has at least r linearly independent columns, i.e. 

spark(if) > r (4.5) 

the following strict inequality holds: 

(H) > 0; (4.6) 

(e) if < ri < r2, then 

Pri,min 

{H)> 

Pr2 ,min 

(H). (4.7) 

Proof: See Appendix iDl □ 
We now compare minimum r-SSQ with the commonly used restricted isometry property (RIP) [lH, flVl, 
[HD, [ini, El, [l20l, [im and coherence (1, ttEl, [[l6l, [[IHl, [l37l, [IMl. The idea behind RIP is to upper- 
bound deviations of the squared magnitude of H s from the squared magnitude of s for arbitrary nonzero 
r-sparse vectors s; therefore, the following quotient should be close to unity for arbitrary nonzero r-sparse 

mslll sr„r„s ^^^^^ 



The restricted isometry constant (RIC) for sparsity level r can be written as (see [jVl e.q. (1.7)]): 



7r(-ff) = max 

SgiSr\Omxl 



1 



\Hs 



|2 



\s\\l 



max 

m X 1 



1 - 



s 



(4.9) 



which quantifies the largest-magnitude deviation of (|4.8I) from unity. Clearly, the smaller the r-RIC is, the 
closer to orthonormal any r columns of H are. The assumption that the appropriate RIC is sufficiently small 
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is key for sparse-signal recovery analyses of the IHT algorithms |[20l . ETTl . CoSaMP IfTTl . and convex 
relaxation methods [|7]|, [fTTIl . lfT2ll . The coherence measures the largest-magnitude inner product of any two 
distinct columns of H. The assumption that the coherence is small is a basis for sparse-signal recovery 
analyses of convex relaxation methods in [38 1 and of greedy methods (such as OMP) in [|T5l . However, both 
the RIP and coherence requirements are somewhat fragile: a simple linear transform or even a scaling of H 
by a constant can easily break the equilibria required by the RIP or coherence. In comparison, the minimum 
r-SSQ in (|4.1bl) measures the smallest normalized squared magnitude of the projection of an r-sparse signal 
onto the row space of the sensing matrix H. Here, it is the row space of H that matters, rather than H 
itself. Lemma [3] (b) states that the minimum r-SSQ is invariant to invertible linear transforms of the rows 
of H. Therefore, the sensing matrix H can be pre-multiplied by any N x N full-rank matrix|l] leading to 
arbitrary column magnitudes and highly correlated columns, while still keeping the same minimum r-SSQ 
value. Hence, minimum r-SSQ is a more flexible property of H than RIP and coherence. 

We now utilize the minimum SSQ measure to establish a condition under which the solution to the (Pq) 
problem is unique and leads to exact recovery under the noiseless scenario. A similar problem is considered 
in [|71 Lemma 1.2] and (H Theorem 2], where such uniqueness and exact recovery conditions have been 
derived using RIP (|4.9I) and spark. 

Lemma 4: Suppose that we have collected a measurement vector y = Hs'^ using a proper sensing matrix 
H, where is a sparse signal vector having exactly || s^||^g = r* nonzero elements. If the minimum 2r*-SSQ 
of the sensing matrix H is strictly positive: 

P2r'>Mn{H) > (4.10) 

then the solution to the (Pq) problem (|l.ll) is unique and coincides with s*. 

Proof: See Appendix O □ 
Observe that the condition (14.101) implies that the number of measurements is larger than or equal to 
twice the support size of the true sparse signal s^, i.e. 

Ar>2r* = 2||s*||^„. (4.11) 

Indeed, if A^ < 2r*, p2ro,mmiH) = by part (c) of Lemma [3l 

Lemma |4] also holds if we replace r* in the condition (14.101) with any r > r*, which follows from part 
(e) of Lemma m if P2r,mm{H) > for < r, then p2ro,mmiH) > P2r,mmiH) > 0. Therefore, (I4.10|) is the 
weakest condition on H among all r > r'^. 

'Unlike the ECME convergence analysis in Section [nil invertible linear transforms of the columns of H are generally not allowed here. 
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In a nutshell, Lemma |4] states that, for a strictly positive p2r,mm{H), any two distinct r-sparse vectors 
can be distinguished from their projections onto the row space of H, which furthermore guarantees the 
uniqueness of the (Pq) problem. Note that [7, Lemma 1.2] states that the solution to the (Pq) problem (11.11 ) 
is unique and coincides with if the 2r^-RIC of the sensing matrix H satisfies 

72ro(^) < 1. (4.12) 

However, for proper sensing matrices, the condition (14.101) of Lemma|4]is weaker than (14.121) : (14.121) implies 
that spark(i7) > 2 [see (I4.9|) 1 and, consequently, (|4.10l) . but not vice versa. For example, the 2 x 3 sensing 
matrix 

(4.13, 



satisfies the condition (14.101) with p2,min{H) = 1/3 > 0, but violates (|4.12l) . since its 2-RIC is ^2{H) = 
1.618 > 1. Hence, (14.101) does not imply (14.121) . 

We now develop reconstruction performance guarantees for our ECME algorithms that employ the 
minimum r-SSQ measure. 

Theorem 3: Exact Sparse Signal Reconstruction From Noiseless Samples. Suppose that we have 
collected a measurement vector 

y = Hs'' (4.14a) 

where G Sr is an r-sparse signal vector, i.e. < r. If the minimum 2r-SSQ of the sensing matrix 

H satisfies 

P2rMniH) > 0.5 (4.14b) 

then the ECME iteration for the sparsity level r in Section III-AI converges to the ML estimate of 0: 

^ML(r) = (s^0) (4.14c) 

and therefore recovers the true sparse signal perfectly. 

Proof: See Appendix iDl □ 
Theorem [3] shows that, upon convergence and if the minimum 2r-SSQ of the sensing matrix is sufficiently 
large, the ECME algorithm recovers the true sparse signal perfectly from the noiseless measurements. 
In this case, the ECME iteration converges to the global maximum of the marginal likelihood (I2.4al) . which 
is infinitely large since the ML estimate of is zero. This global convergence is guaranteed regardless of 
the initial estimate of 6 used to start the ECME iteration. In addition, by Lemma |4l is also the unique 
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solution to the (Pq) problem. Therefore, under the conditions of Theorem [3l the ECME algorithm solves 
the (Pq) problem as well. 

Next, we consider a more practical scenario where the true signal is not strictly sparse and the 
measurements y are corrupted by noise. 

Theorem 4: Near-Optimal Recovery of Non-sparse Signal From Noisy Samples. Suppose that we have 
collected a measurement vector 

y = Hs'' + n (4.15a) 

where the signal is not necessarily sparse and n G TZ^ is a noise vector. Denote by the best r-term 
£2-norm approximation to s^, i.e. 

= argmin ||s - = 7;(s*) (4.15b) 

and by s* the r-sparse signal estimate obtained upon convergence of the ECME iteration for the sparsity 
level r in Section III-AI If the minimum 2r-SSQ of the sensing matrix H satisfies 

P2r-,min(i^) > 0.5 (4.15c) 

which is the same as the condition (I4.14bl) in Theorem [3l then 

V P2r,minl,-n ) — \/ I — P2r,min(,-n ) 

Proof: See Appendix |Dl □ 
Theorem m shows that, for a general (not necessarily sparse) signal and noisy measurements satisfying 
(|4.15al) and sensing matrix satisfying (|4.15cl) . the ECME estimate is close to the best r-term ^2-norm 
approximation of s^. This result holds regardless of the initial estimate of 6 employed by the ECME 
iteration. Observe that, by (14.41) . p2r,mm{H) < 1 and therefore the squared roots in (I4.15dl) are well-defined. 
Moreover, since (|4.15cl ) holds, the denominator on the right-hand side of (|4.15d|) is positive and less than 
or equal to one. When the noise n is zero and signal is r-sparse, the quantities ||s^ — s*||^2 
\\H'^ {HH'^)^^n\\£^ in (I4.15dl) are zero and, therefore, ||s* — s^\\i^ = 0, consistent with Theorem |3l 

Performance guarantees similar to those in Theorems [3] and |4] have been developed for other sparse 
reconstruction methods. However, these results rely on either small RIP constants (see e.g. [7, Theorems 
1.3, 1.4], [dll Theorem 1], [HI Theorem 1.1] and EtI Theorem A], ^ Theorems 4, 5], UU Theorem 
4]) or small coherence (see e.g. [f38l Theorem 2] and lfT5l Theorem 3.5]). Therefore, all previous results 
require that a certain numbers of columns of the sensing matrix H are approximately orthonormal (RIP) 
or orthogonal (coherence). In contrast, our analysis of the ECME method in Theorems [3] and |4] applies 
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to the cases where the columns of H are not approximately orthonormal and can be heavily correlated, 
thus widening the class of sensing matrices for which it is possible to derive reconstruction performance 
guarantees, see also the discussion after Lemma [3l 

A. An Example of a Low -dimensional Matrix Satisfying The Conditions of Theorems \3\ and E] 

The ongoing search for desirable sensing matrices focuses on small RIP constants and on asymptotic 
behavior of large random matrices, e.g. Gaussian, Bernoulli (with entries equal to 1 and —1), and Fourier 
(randomly selected rows of the DFT matrix) matrices, see e.g. [|7]| and [[39l . We now show that it is possible 
to find low-dimensional sensing matrices that satisfy the condition p2r,mmiH) > 0.5 of Theorems [3] and IH 
Consider the 21 x 32 sensing matrix H comprised of the 21 rows of the 32 x 32 type-II discrete cosine 
transform (DCT) matrix (see e.g. BUI Sec. 8.8.2]) with indices 

2, 3, 4, 5, 7, 9, 10, 12, 13, 14, 16, 18, 20, 21, 22, 24, 27, 29, 30, 31, 32. (4.16) 

It can be verified by combinatorial search that the minimum 2-SSQ of H meets the condition (|4.14b|) : 

P2,min(i^) = 0.503 > 0.5 (4.17) 

and, therefore, by Theorem [3l for this H, our ECME iteration perfectly recovers any 1 -sparse signal s from 
the 21 noiseless linear measurements given hy y = Hs. (We have checked and confirmed the validity of 
this statement via numerical simulations.) However, the 2-RIC of the same 21 x 32 sensing matrix H is 

72(i^) = 0.497. (4.18) 

which violates the condition required in the theoretical analysis of the IHT algorithm [|20l Theorems 4 and 
5]. In particular. Theorems 4 and 5 in EOl require 'jsiH) < l/\/32 ^ 0.177 for 1-sparse signals, but here 
73 (-f^) > 12{H) = 0.497. We have checked that the above sensing matrix H also violates the condition 
required in the theoretical analysis of the NIHT algorithm ETl Theorem 4]. Indeed, for 1-sparse signals 
and the above sensing matrix H, the non-symmetric restricted isometry constant in [[2T| is at least 0.611, 
which is larger than the upper limit 0.125, see [|2T1 Theorems 4]. 

Due to the invariance property of Lemma [3] (b), any invertible linear transformation of the rows of H 
preserves the minimum r-SSQ constant. Therefore, upon finding one good sensing matrix H that satisfies 
p2r,mm{H) > 0.5, wc Can construct infinitely many matrices that satisfy this condition. 

V. The DORE Algorithm for Known r 

We now present the DORE thresholding method that accelerates the convergence of our ECME iteration. 
Since the sparsity level r is assumed known, we omit the dependence of the estimates of on r in this 
section. 
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Assume that two consecutive estimates of the unknown parameters 6^p~^^ = (a^)*^^^^^) and 
Qip) _ (^s(p)^ (a'^)^P^) are available from the {p — l)-th and p-th iterations, respectively. Iteration p + 1 
proceeds as follows: 

1. ECME step. Compute 

s = 7;(s(^') + H^{HH^)-\y - Hs^^^)) (5.1a) 

d^ = {y- Hsf {HH^)-^{y- Hs)/N (5.1b) 

and define = (s, cr^). 

2. First overrelaxation. Compute the linear combination of s and s*^*'^: 

z = s + ai{s- s^P^) (5.2a) 

where the weight 

{Hs- Hs^pY {HH^y^iy- Hs) 

is the closed-form solution of the line search: 

ai = aj:gmaxpyig(y\ {s + a{s- s^^)),o-^)) (5.2c) 

a 

with the parameter space of 6 extended to 6^, where ri = dim(supp(s) U supp(s*^P))) is the sparsity level 
of s + « (s — s^P^) and cr^ is an arbitrary positive number, see also (I2.4al) . 

3. Second overrelaxation. Compute the linear combination of z and s^^^^^: 

z = z + a2 (z - s^P-^'^) (5.3a) 

where the weight 

^ {H z - H s^P'^)f {H H^)-^ jy-Hz) 

{H z - H s(p-^)y {H H^y^H z - H s(p-^)) ^ ^ 

is the closed-form solution of the line search: 

a2 = arg max py\e(y \{z + a{z - s^^^^^), a^)) (5.3c) 

with the parameter space of 6 extended to 6r-2, where r2 = dim(supp(z) U supp(s'^^'"^^)) is the sparsity 
level of z + « (z — s^p^^^) and cr^ is an arbitrary positive number. 

4. Thresholding. Threshold z to the sparsity level r: 

s = %.{z) (5.4a) 
compute the corresponding variance component estimate: 

= {y-Hs)'^{HH^)-yy-Hl}/N (5.4b) 
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and define our final overrelaxation parameter estimate 6 = (s.a'^). 

5. Decision (between ECME and thresholded overrelaxation parameter estimates). \f py\e{y\6) > 
Py\g{y\6) or, equivalently, if 

< (5.5) 

assign 0^^^^^ = 6; otherwise, assign 0^^^^^ = 6 and complete Iteration p+1. 

Iterate until two consecutive sparse-signal estimates s^'^'^ and s(p+^) do not differ significantly. Since 
{H H^)^^ can be pre-computed, our DORE iteration does not require matrix inversion; the line searches 
in the two overrelaxation steps have closed-form solutions and are therefore computationally efficient, see 
Section IV-AI for details on computational complexity. 

If the rows of the sensing matrix H are orthonormal [i.e. (|2.8I) holds]. Step 1 of the DORE scheme 
reduces to one IHT step. After Step 1, we apply two overrelaxations (Steps 2 and 3) that utilize the sparse 
signal estimates s*^^^ and s^^^^^ from the two most recent completed DORE iterations. The goal of the 
overrelaxation steps is to boost the marginal likelihood (I2.4al) and accelerate the convergence of the ECME 
iteration. Using a single overrelaxation step based on the most recent parameter estimate is a common 
approach for accelerating fixed-point iterations, see [28]. Here, we adopt the idea in ||28l Sect. 5.1] and 
apply the second overrelaxation, which mitigates the 'zigzagging' effect caused by the first overrelaxation 
and thereby converges more rapidly. Our algorithm differs from that in [28, Sect. 5.1], which focuses on 
continuous parameter spaces with marginal likelihood that is differentiable with respect to the parameters. 
Unlike [IM Sect. 5.1], here we 

• apply overrelaxation steps on parameter spaces with variable dimensions (Steps 2 and 3), 

• threshold the second overrelaxation estimate to ensure that the resulting signal estimate is r-sparse 
(Step 4), and 

• test the thresholded estimate from Step 4 versus the ECME estimate from Step 1 and adopt the better 
of the two (Step 5). 

Step 5 ensures that the resulting new parameter estimate 0^'p^^'> yields the marginal likelihood function (|2.4al) 
that is higher than or equal to that of the standard ECME step (Step 1). Therefore, the DORE iteration 
(I5.1|) - (I5-5|) ensures monotonically nondecreasing marginal likelihood between consecutive iteration steps: 

Py\e{y\e^'^'^)>Py\e{y\e^^). (5.6) 

Furthermore, under the conditions of Theorem [H the DORE iteration converges to a fixed point of the 
ECME iteration. This convergence result follows from the facts that each DORE iteration contains an 
ECME step and yields the marginal likelihood (I2.4al) that is higher than or equal to that of the standard 
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ECME step. To see this, consider two consecutive DORE signal estimates s^p^ and and the ECME 

estimate s in Iteration p + I. Due to Step 5 of the DORE scheme, we have 



where A = supp(s^P^) U supp(s). (|5.7b| ) follows from (IB .61) of the proof of Theorem [T] in Appendix |Bl 
Since the sequence £^(s(p)) — converges to zero and the conditions of Theorem [T] ensure that the 

term 1 — Ainax(-f^l (H H^)^^ Ha) in (I5.7bl) is strictly positive (see the proof of Theorem [T] in Appendix 
E]), II s — s^^'^lll^ converges to zero as well, implying the convergence of the DORE iteration to an ECME 
fixed point. In addition, by Theorem [21 the fixed point that DORE converges to is also a local maximum of 
the concentrated marginal likelihood function (12.61) . The near-optimal recovery results (Theorems [3] and H]) 
in Section |IV] also apply to DORE and can be easily derived along the lines of the proofs for the ECME 
algorithm in Appendix |D] using the facts that each DORE iteration contains an ECME step and yields the 
marginal likelihood (I2.4al) that is higher than or equal to that of the standard ECME step. 

DORE Initialization. The parameter estimates 0^^^ and 0^'^^ are obtained by applying two consecutive 
ECME steps (12.71) to an initial sparse signal estimate s^'^\ 

A. Computational Complexity and Memory Requirements 

The major computational complexity of the ECME algorithm lies in the matrix-vector multiplications and 
sorting of m x 1 vectors. Assuming that the common bubble sorting is employed, sorting 2;^^+^) in (I2.7bl) 
requires (9(m^) operations. There are three matrix- vector multiplications in one ECME iteration, namely 
Hs'^p\ {HH^Y^Hs^'P'^] and H^[{HH^)-^Hs^p\ which requires 0{Nm), 0{N^) and 0{Nm) operations, 
respectively. The intermediate computation results of [a'^Y^^^^ in (|2.7cl) can be stored and used to compute 
(|2.7bl) of the next iteration; therefore, this step does not cause additional computation. In summary, the 
complexity of one ECME iteration is C(m^ + 2Nm + N"^). If H has orthonormal rows satisfying (12.81) . 
ECME reduces to the IHT iteration, and in this case {H H'^)^^[H s^p"*] is simply Hs^p\ The computation 
complexity of one IHT step is therefore C(m^ + 2Nm). 

For DORE, there are two sorting operations per iteration, one in step 1 and the other in step 4, 
requiring 0{2m?) operations. In one DORE iteration, we need to compute H^[(HH^)~^Hs''P^], Hs, 
{H H'^)-^ \Hs\, Hs, and {H H'^y^ \Hs\, which require total of 0{?>Nm + 2N'^) operations. Note that 
Hs^P-^\ {HH^)-\Hs^P-% Hs^P\ and {H H^)-\H s^p'^] in (l5Jal) . (IS^Ibl) and (l53bl) can be adopted 
from the previous two iterations and do not need to be computed again in the current iteration; in addi- 
tion, the quantities Hz and {HH'^y^[Hz] in (I5.3bl) are simple linear combinations of computed terms 



(5.7a) 



> [1-K,^,{HI{HH^)-^Ha)]\\s-s 




(5.7b) 
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and do not require additional matrix- vector computations. To summarize, one DORE iteration requires 
(9(2m^ + ?)Nm + 2N'^), which is slightly less than twice the complexity of one ECME step. When H 
has orthonormal rows, we do not need to compute {H H^)^^ W s] and (H H^)^^ [Hs\, which brings the 
complexity down to 0{2m? + 3Nm), slightly less than twice the complexity of one IHT step. 

Regarding the memory storage, the largest quantity that ECME (and its special case IHT) and DORE need 
to store is the sensing matrix H requiring memory storage of order 0{Nm). In large-scale applications, 
H is typically not explicitly stored but instead appears in the function-handle form [for example, random 
DFT sensing matrix can be implemented via the fast Fourier transform (EFT)]. In this case, the storage 
requirement of ECME, IHT and DORE is just 0(m). 

Although a single DORE step is about twice more complex than the ECME and IHT steps, it converges 
in much fewer iterations than the ECME and IHT iterations in the numerical examples in Section IVII[ see 
Fig.[3](b) and Fig.|4](c). 

VI. Unconstrained Sparsity Selection Criterion for Selecting r and the ADORE 

Algorithm 

The ECME and DORE algorithms, as well as most other greedy methods, require the knowledge of sparsity 
level r as an input. In this section, we propose an sparsity selection criterion and an automatic double 
over relaxation (ADORE) thresholding algorithm that estimates the signal sparsity from the measurements. 

We introduce the following unconstrained sparsity selection (USS) objective function for selecting the 
proper sparsity level r that strikes a balance between the efficiency and accuracy of signal representation: 

USS(.) = 4 . .„ (^) - 1 (iV - . - 2) in (6.1) 

where S'ml(^) the ML estimate of the variance component in the parameter space 6,., see (|2.4b|) . 
USS(r) in (16.11) is developed from the approximate generalized maximum likelihood (GML) objective 
function in [|26l e.q. (13)]; in particular, when (H H^)^^ y/N = 1, the two functions are equal up to 
an additive constant. However, unlike GML, the USS objective function (16.11 ) is scale-invariant: scaling the 
measurements y by a nonzero constant does not change USS(r), which is a desirable property. 

Interestingly, the USS objective (16.11) is closely related to the (Pq) problem (|l.ll) . as shown by the 
following theorem. 

Theorem 5: Suppose that we have collected a measurement vector y = H using a proper sensing 
matrix H, where is a sparse signal vector having exactly = ||s^||^o nonzero elements. If 

(1) the sensing matrix H satisfies the unique representation property (URP) condition (|3.5bl) and 

(2) the number of measurements satisfies 

iV > max{2 + 3} (6.2) 
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then 

• USS(r) in (16.11) is globally and uniquely maximized at r = and 

• the (Po)-optimal solution and ML sparse signal estimate at r = [i.e. Sml('"*), see (I2.4bl) 1 are both 
unique and coincide with s^. 

Proof: See Appendix |El □ 
Theorem [5] shows that the USS objective function transforms the constrained optimization problem (Pq) 

in (11.11) into an equivalent unconstrained problem (16.11) and that USS optimally selects the signal sparsity 

level r that allows accurate signal representation with as few nonzero signal elements as possible. 

In the practical scenarios where > 3, condition (2) of Theorem |5] reduces to > 2 r*^, which is the 

condition required to ensure the uniqueness of the (Pq) problem, see dH Theorem 2]. 

In the following, we use DORE to approximately evaluate the USS objective function and apply this 

approximate USS criterion to automatically select the signal sparsity level. 

A. The ADORE Algorithm for Unknown Sparsity Level r 

We approximate the USS objective function (16.11) by replacing the computationally intractable ML estimate 
a^jj^(r) with its DORE estimate. Maximizing this approximate USS objective function with respect to r by 
an exhaustive search may be computationally expensive because we need to apply a full DORE iteration for 
each sparsity level r in the set of integers between and A^/2y Here, we propose the ADORE algorithm that 
applies the golden-section search [|34[ Sec. 4.5.2.1] to maximize the approximate USS objective function 
with respect to r, with the initial search boundaries set to and [A^/2]. Note that USS(O) = assuming 
that y 7^ Oatxi, which is of practical interest. For each candidate < r < [A^/2], we estimate al^^^{r) 
using the DORE iteration. After running one golden sectioning step, the length of the new search interval is 
approximately 0.618 of the previous interval (rounded to the closest integer). The search process ceases when 
the desired resolution L is reached, i.e. when the searching interval becomes shorter than the prescribed 
resolution level L. Therefore, ADORE requires roughly 1.4 [log2(A^/L) — 1] full DORE iterations. For the 
golden-section search to find the exact maximum of (|6.1I) . USS(r) must be unimodal in r, which is not 
true in general. Hence, ADORE maximizes (|6.1I) only approximately, yielding r adore; then, our ADORE 
sparse-signal estimate is equal to the corresponding DORE estimate at r = tadore- 

VII. Numerical Examples 

We now compare our proposed methods in Sections |V] and |Vl] with existing large-scale sparse recon- 
struction techniques using two image recovery experiments, with purely and approximately sparse signals, 

^Note that N/2 is the largest value of the sparsity level r for which reasonable reconstruction is possible from A'^ measurements; otherwise, 
the (Po) and ML estimates of the sparse signal may not be unique, see e.g. (8. Theorem 2], 
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respectively. In particular, we compare 

• the DORE and ADORE schemes initialized by the zero sparse signal estimate: 

S^'^ = Omxl (7.1) 



with ADORE search resolution set to L = 500 and Matlab implementations available at http://home.eng.iastate. 

• the IHT and NIHT schemes in [|20ll and [|2T]| . initialized by the zero sparse signal estimate s^^^ in (17.11 ); 

• the automatic hard thresholding (AHT) method in [|26ll using the moving-average window length 100, 
initialized with 2init = Omxi and rinit = 1; 

• the debiased gradient-projection for sparse reconstruction method in [fT3l Sec. III.B] with the conver- 
gence threshold tolP = 10"^ and regularization parameter set to 

(i) r = 0.1 \\H^y\\e^, suggested in ^ e.q. (22)] (labeled GPSRq) and 

(ii) r = 0.001 obtained by manual tuning for good performance in the following two 
numerical examples (labeled GPSR); 

• the minimum-norm signal estimate (labeled MN): 

s^^ = H^{HH^r'y {12) 

which achieves zero squared residual error by ignoring sparsity. 
For the DORE, ADORE, IHT and NIHT iterations, we use the following convergence criteriory: 

III 10-14. (7.3) 

The sensing matrix H has the following structure (see e.g. [5, eq. (2) and Fig. 1]): 

H = (7.4) 

where <P is an N x m sampling matrix and 'i' is an appropriate m x m orthogonal sparsifying transform 
matrix. In our examples presented here, are inverse discrete wavelet transform (DWT) matrices 1*411. 

For an underlying image ^ s, the signal vector s is the wavelet coefficient vector of the image. Our 
performance metric is the peak signal-to-noise ratio (PSNR) of a reconstructed image iZ^ s, where s is the 
estimated wavelet coefficients vector: 

[(!?'s)max - (!^s)min]^1 ( [{^ s)max - S'' 12 



PSNR (dB) = 10 logio 11,;^ ,r/i|2 = 10 logio / (7-5) 

I \\W s — W s\\ljm ) I ||s — s|||^/m 

where ( iZ^ s)min and ( W s)max denote the smallest and largest elements of 'i' s. 



To implement the IHT and NIHT schemes, we incorporated the convergence criterion l l7.3b into the corresponding MATLAB codes from 
the sparsify toolbox at http://www.see.ed.ac.uk/~tblumens/sparsify/sparsify.html 
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(a) (b) (c) 




Fig. 2. (a) The size-256^ Shepp-Logan phantom, (b) a star-shaped sampUng domain in the frequency plane containing 44 radial lines, and 
(c) the filtered back-projection (minimum-norm) reconstruction, (d) GPSRo reconstruction, (e) GPSR reconstruction, and (f) the almost perfect 
reconstruction achieved by all hard-thresholding schemes, for the sampling pattern in (b). 

A. Tomographic Image Reconstruction 

Consider the reconstruction of the Shepp-Logan phantom of size m = 256^ in Fig. [2] (a) from tomographic 
projections. The elements of y are 2-D discrete Fourier transform (DFT) coefficients of the phantom 
sampled over a star-shaped domain, as illustrated in Fig. [2] (b); see also [[2T|. and ll26l . Therefore, 
the sampling matrix <P is constructed using selected rows of the DFT matrix that yield the corresponding 
DFT coefficients of the phantom image within the star-shaped domain. In this example, we select the inverse 
Haar (Daubechies-2) DWT matrix to be the orthogonal sparsifying transform matrix The Haar wavelet 
transform coefficients of the phantom image in Fig. [2] (a) are sparse, with ||s||^f, = 3769 ~ 0.06 m, where 
the true signal vector s consists of the Haar wavelet transform coefficients of the phantom in Fig. |2] (a). 

For our choices of ^ and iZ^, the rows of H are orthonormal, i.e. (12.81) holds, implying that IHT is 
equivalent to the ECME iteration in Section III-AI DORE, IHT, and NIHT require knowledge of the signal 
sparsity level r; in this example, we set r to the true signal support size: 

r = 3769. (7.6) 

In contrast, the ADORE and AHT methods are automatic and estimate r from the measurements using the 
USS and GML selection criteria, respectively. 

Figs. [2] (c)-(f) show the images reconstructed by the above methods using the 44 radial-line sampling 
pattern in Fig. |2] (b), which corresponds to N/m = 0.163. In this example, the MN signal estimate (|7.2I) 
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Fig. 3. (a) PSNR, (b) number of iterations, and (c) CPU time as the functions of the normalized number of measurements N/m for phantom 
image reconstruction. 



is also the filtered back-projection estimate obtained by setting the unobserved DFT coefficients to zero 
and taking the inverse DFT, see [2J. Here, all hard-thresholding methods (DORE, IHT, NIHT, ADORE, 
and AHT) achieve almost perfect reconstructions of the original phantom image with PSNRs over 100 dB, 
in contrast with the MN (filtered back-projection) and GPSR methods that achieve inferior reconstructions 
with PSNRs 20.2 dB for the MN, 33.0 dB for GPSR, and 17.9 dB for GPSRq estimates. 

Fig. [3] shows (a) the PSNRs, (b) numbers of iterations, and (c) CPU times of the above methods as we 
change N/m by varying the number of radial lines in our star-shaped partial Fourier sampling pattern. In this 
example, all hard-thresholding methods have significantly sharper phase transitions than the manually tuned 
GPSR, and outperform GPSR after the phase transitions. AHT exhibits the phase transition at N/m ^ 0.15; 
the phase transitions of the other hard thresholding methods occur at N/m ^ 0.16. ADORE performs as 
well as the DORE, IHT, and NIHT methods that require prior knowledge of the signal sparsity level. 
Indeed, the USS criterion accurately selects the signal sparsity level in this case, which is consistent with 
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the essence of Theorem [51 Among all hard-thresholding methods, DORE needs the smallest number of 
iterations to converge and is also the fastest in terms of the CPU time. DORE needs 4.4 to 10.9 times less 
iterations than IHT and 2.3 to 6 times less iterations than NIHT; in terms of the CPU time, DORE is 2.7 
to 6.7 times faster than IHT and 3.6 to 16.3 times faster than NIHT. The CPU times of DORE, IHT, and 
ADORE per iteration are approximately constant as N/m varies. One DORE step is about twice slower 
than one IHT step, validating the computational complexity analysis in Section |V-A[ 

We now compare the two automatic thresholding methods (AHT and ADORE) in this example: ADORE 
requires 3.7 to 7.7 times less iterations and is 2.2 to 4.6 times faster than AHT. We note that AHT's 
computational complexity does not scale well with the increase of the signal size and its support, which is 
the case considered in the following example, where we increase the signal size four times, to m = 512^. 

B. Lena Reconstruction From Compressive Samples 

We now reconstruct the standard Lena image of size m = 512^ in Fig. |4] (a) from compressive samples. 
In this example, we select the structurally random sampling matrices <P proposed in [[42| and the inverse 
Daubechies-6 DWT matrix to be the orthogonal sparsifying transform matrix W . The wavelet coefficients 
of the Lena image are only approximately sparse. If we have a parameter estimate 6{r) = (s(r), a^(r)), 
we can construct the following empirical Bayesian estimate of the missing data vector z: 

E,\y,g[z I y,d{r)] = s{r) + H^{HH^r'[y - Hs{r)]. (7.7) 

Unlike s(r), the empirical Bayesian estimate (17.71) is not r-sparse in general, and is therefore preferable 
for reconstructing approximately sparse signals that have many small-magnitude signal coefficients. 

For our choices of <P and \P, the rows of H are orthonormal, i.e. (12.81) holds, and IHT is equivalent to 
the ECME iteration in Section III-AI For all hard thresholding methods, we apply the empirical Bayesian 
estimate (17.71) . with 6{r) equal to the parameter estimates obtained upon their convergence. We chose the 
sparsity level 

r = 10000 ^ 0.038 m (7.8) 

to implement the DORE, IHT, and NIHT iterations. 

Figs, m (b)-(d) show the PSNRs, numbers of iterations, and CPU times of various methods as functions 
of the normalized number of measurements (subsampling factor) N/m. Here, we do not include the AHT 
and MN estimates in the simulation results because, in this example, AHT is very slow compared with 
the other approaches and does not outperform them in terms of reconstruction performance, and the MN 
estimates (17.21) are poor; indeed, the PSNRs of the minimum-norm estimates vary between 14.21 dB and 
16.24 dB for the range of N/m in Figs. H (b)-(d). 
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N/m N/m 

Fig. 4. (a) The 512 x 512 Lena image, and (b) PSNR, (c) number of iterations, and (d) CPU time as the functions of the normahzed number 
of measurements N/m for Lena image reconstruction. 



Unlike the phantom reconstruction example in Section IVII-A[ here the underlying signal (the vector of 
the wavelet coefficients of the Lena image) is not strictly sparse and, consequently, 

• the difference in reconstruction accuracy between the manually tuned convex GPSR method and hard 
thresholding methods is significantly smaller: compare Fig. |4] (b) with Fig. [3] (a); 

• the achieved PSNRs of all methods are significantly smaller as well, even though the subsampling 
factor N/m ranges over a fairly wide interval, between 0.2 and 0.5. 

The importance of tuning the GPSR's regularization parameter r is evident from Fig. H] (b): GPSRq 
reconstructs the signal poorly compared with the tuned GPSR. We point out that it is not known how to 
manually tune r in practical cases where the ground-truth image in Fig. |4] (a) is not available. In contrast, 
our ADORE algorithm automatically selects the sparsity level and performs similarly to the other methods 
that require careful tuning; ADORE is particularly competitive when the number of measurements is fairly 
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large, see Fig.|4](b). Therefore, the USS model selection criterion is quite effective in this practical example 
where the underlying signal is not strictly sparse. Figs.|4](c) and (d) show that DORE requires the smallest 
number of iterations and CPU time among the hard thresholding methods, and that it is also faster than the 
manually tuned GPSR method for N/m < 0.4. When N/m > 0.3, the CPU time of the automatic ADORE 
method (which employs multiple DORE iterations) is comparable to that of the tuned NIHT method. 

VIII. Concluding Remarks 

We proposed a probabilistic framework for sparse signal reconstruction in underdetermined linear models 
where the regression coefficient vector consists of a sparse deterministic component and a random Gaussian 
component. We developed three hard thresholding methods based on this framework: ECME, DORE, and 
ADORE. We showed that, under certain mild conditions, ECME converges to a local maximum of the 
concentrated marginal likelihood for the above probabilistic model. Our ECME convergence conditions are 
invariant to invertible linear transforms of either the rows or the columns of the sensing matrix. To develop 
our near-optimal recovery results for the ECME and DORE methods, we introduced new measures of the 
sensing matrix's reconstruction ability: sparse subspace quotient (SSQ) and minimum SSQ. Minimum SSQ 
is more flexible than the well-established restricted isometry property (RIP) and coherence measures: it 
is invariant to invertible linear transforms of the rows of the sensing matrix. When the minimum 2r-SSQ 
is sufficiently large, our ECME for sparsity level r perfectly recovers the true r- sparse signal from the 
noiseless measurements and estimates the best r-term approximation of an arbitrary non-sparse signal from 
noisy measurements within a bounded error. The DORE algorithm interleaves two overrelaxation steps 
with one ECME step and significantly accelerates the convergence of the ECME iteration. To automatically 
estimate the sparsity level from the data, we proposed the unconstrained sparsity selection (USS) criterion 
and utilized it to develop the automatic ADORE scheme that does not require prior knowledge of the signal 
sparsity level. 

Since only a single choice (17.11) is used to initialize DORE and ADORE, their PSNR curves in Section IVIIl 
are only lower bounds on the PSNRs achievable by these methods. The reconstruction performances of these 
methods can be improved by using multiple initial values, where the improvement is particularly significant 
for purely sparse signals: our preliminary results indicate that, in terms of reconstruction accuracy, DORE 
with multiple initial values outperforms AHT in the phantom example [see Fig. [3] (a)] and can slightly 
outperform the manually tuned GPSR in the Lena example [see Fig. |4] (b)]. Full details of the multiple 
initialization scheme and its reconstruction performance will be published elsewhere. Further research will 
also include: 

• analyzing the convergence speed of the DORE algorithm. 
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• looking for systematic means of generating sensing matrices that have large minimum SSQ, and 

• applying our probabilistic framework to develop sparse signal reconstruction methods for quantized 
measurements. 

Appendix 

Appendix A 
ECME Algorithm Derivation 

Consider the following hierarchical two-stage model: 

Pyl.{y\z) = Ar{y;Hz,C) (A.la) 
Pz\eiz\d) = M{z- s.a^I^) (A.lb) 

where z is the vector of missing data and C is a known noise covariance matrix. For C = Onxn, this 
model reduces to that in (|2.1al )- (|2.1b| ) in Section [III 

We will first derive an EM step for estimating s under the above general model and then set C = Onxn 
to reduce it to the EM step in Section |lll The complete-data likelihood function of the measurements y 
and the missing data z given 6 = (s, a^) G ©r follows from (|A.lal) and (|A.lbl) : 

From (|A.2I) . the conditional pdf of z given y and 6 is 



P. 



z I y.' 



e{z\y,e)=^(^z; s + {C + a' H H^)'^ {y - H s),a^ - {a^ {C + a' H H^Y^ H 

(A.3) 

see [|33l Theorem 11.1]. Assume that the parameter estimate 0*^^-' = (s'^p^ (o"^)^^^) is available; then, in 
Iteration p + 1, the E and M steps for estimating s simplify to 

^(p+i) = E , I [z 1 2/, e^^')] = s^*') + (a^) [C + (a^) H H^]-^ {y - H s^p^) (A.4a) 

and 

^(p+i) = argmin \\z^p+^^ - s\\l = Tr{z^''+^^). (A.4b) 

Setting C = Onxn in (|A.4al) and (IA.4bl) yields (I2.7al) and (I2.7bl) . which are not dependent on (cr^)^^). 
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Appendix B 
Proof of Theorem [H 

We first prove Lemma [51 which will be used in the proof of Theorem [T] 

Lemma 5: Assume that the sensing matrix H satisfies the URP condition, see also (|3.5bl) . For an index 
set A C {1,2, . . . ,m}, 



(a) if 



then 



(b) if 



then 



< dim(A) < (B.la) 



X^,^{Hl{H H^)-' Ha) > 0, (B.lb) 



< dim(yl) <m- N (B.2a) 



X^^^{hI{HH^)-^Ha) < 1. (B.2b) 



Proof: The conditions (I3.5bl) and (IB.lal) imply that all columns of Ha are linearly independent; 
therefore, H^(HH^)~^Ha is a full-rank positive definite matrix, and (IB.lbl) follows. 
We now assume (IB.2al) and show (|B.2bl) . Observe that 

Xmax{HAiHH^) ^ Ha) = Xma.x{iH H'^) ^ H aH"^) = Amax(-^Ar — (H H'^)^^ Ha^H^c) (B.3) 

where 

A' = {l,2,...,m}\A (B.4) 

defines the index index set complementary to A. Since dim(y4^) = m — dim(yl) > A^, (H H^)~^ iJ^cifJc 
is positive definite; therefore, 

\^^^{hI{HH^)-^Ha) = l-Xmin{{HH'')-'HAcH^c) <1 (B.5) 

and (iRlbl) follows. □ 
Now, we prove Theorem [TJ 
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Proof of Theorem\T} We now prove that our ECME iteration converges to its fixed point. If s^^^^^ = 
s^'^\ the convergence to a fixed point immediately follows. Therefore, without loss of generality, we assume 
s(p+i) ^ sip)_ Since E^s'^^^) in (13.11) converges to a limit, £(s^'''^) — £{s^P'^^^) converges to zero. Now, 



(p+i) I gip)-^ 



> 



,(p+i) 



[In 



) [-'dim(yl) 



HiiHH')-'HA] (s 



(p+i) 



,(p+i) 



— s 



iph 



||s(p+i) _S(P)||| 



.(p+i) 



.{P)||2 



— S 



(P)||2 



(B.6a) 
(B.6b) 
(B.6c) 

(B.6d) 
(B.6e) 



where A = supp U supp(s(P+i)). Here, fiKM follows from (l3^ . (lR6bl) follows by (l33al) and the 
fact that H(s(P) | s^p^) = 0, (lR6cl) is obtained by using the identities Hs'^* 



(p+i) 



,(p+i) 



,(P)||2 



and if fs^P+i) 



s^^), and (IB.6el) follows by using the Rayleigh-quotient property [|35 



Theorem 21.5.6]. Note that < dim(y4) < 2r < m — N, where the second inequality follows from (13. Sal) . 
Therefore, (IB.2al) holds and (IB.2bl) in Lemma [5] implies that the term 1 - A^ax ( H'^{HH'^y^HA ) in (IB.6el) 
is strictly positive. Since S{s^p'^) — £^(s*^p+^^) converges to zero, then ||s^p+^) — s^^^^H^^ converges to zero as 
well. Finally, the claim of monotonicity of convergence follows from the discussion in Section |ni] prior to 
Theorem [U This completes the proof. □ 

Appendix C 
Proofs of Lemma [H Lemma [2] and Theorem [2] 

Proof of LemmaU} The proof is by contradiction. Suppose that there exists an index i E {1,2, . . . , m} 
satisfying (I3.6al) . but not (I3.6bl) : without loss of generality, assume that the following partial derivative is 
positive: 

dfis*) A dfis) _u^fis' + ee,)-fis*) ^ , ^^ ^^ 



dsi 



dsi 



e-i-0 



S = S* e-i-U e 

where Cj is the zth column of J^. By the definition of the limit, there exists a 5 > such that, for all 

ee (0,5), 

/(s* + ee,,)-/(s*) dfis 



f{s*-ee,)-f{s* 



dsi 
df{s^ 



dsi 



< 



< 



1 df{s' 
1 df{s^ 



and, therefore. 



/(s* + ee,) >/(**) + |e 
/(5*-6e,,)</(s*)- 



df{s*_ 

dsi 
dfjs*: 
dsi 



dsi 



(C.2a) 
(C.2b) 

(C.3a) 
(C.3b) 
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For all e G (0,5), the vectors + eej and — eCj are r-sparse, /(s* + eej) is larger than /(s*), and 
/(s* — ee^) is smaller than /(«*), which contradicts the assumption that s* is an r-local maximum or 
minimum point. □ 
Before proving Lemma [21 we prove the following useful result that will be used in the proof of Lemma 

El 

Lemma 6: For any r-sparse vector s' G Sr, there exists a 5 > such that, for all s E Sr satisfying 
||s — s'We^ < 5, we have 

dim(supp(s) U supp(s')) < r. (C.4) 

Proof: The proof is by contradiction. First, define A = supp(s) and A' = supp(s'). Suppose that, for 
all (5 > 0, there exists a s E Sr satisfying 

||s - s'We^ < S (C.5a) 

and 

dim(y4 U A') > r. (C.5b) 

Since dim(y4) < r, 

dim (A' n A^) = dim (A U A') - dim(A) > r - r = (C.6) 

implying that the set A' fl A^ is not empty, see also the definition of the complementary index set in (IB. 4b . 
Choose 5 to be half the magnitude of the smallest nonzero element in s': 

6=lmm\s[\ (C.7) 

Now, 

s'Ika > II^A'nA^IL, > min|s^| > S (C.8) 

which contradicts (|C.5a| ). Therefore, for a positive number 6 in (IC.7I ). no r-sparse vector s can satisfy the 
conditions (IC.5al) and (|C.5b|) simultaneously. □ 
Lemma [6] shows that, in a sufficiently small neighborhood of an r-sparse vector s', the support sets of 
all other r-sparse vectors s significantly overlap with the support set of s'. In particular, if s' has exactly 
r nonzero elements (i.e. \\s'\\£g = r), then all other r-sparse vectors in its sufficiently small neighborhood 
must have the same support set as s'. If s' has less than r nonzero elements, an r-sparse vector s in this 
neighborhood can contain a few inconsistent elements that do not belong to the support set of s' as long 
as (|C.4I) is satisfied. We now prove Lemma [2l 
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Proof of Lemma |2]- We first consider the case of an r-local maximum of /(s) and assume that 
conditions (1) and (2) hold for a point s* G Sr- By condition (2), for the positive number 5i, the Hessian 
matrix is negative semidefinite around s* for all s E Sr satisfying ||s — s^H^^ < 5i. By Lemma [6l for any 
r-sparse vector s*, there exists a ^2 > such that, for all s E Sr satisfying ||s — s^W^^ < §2, we have 

dim(supp(s) U supp(s*)) < r. (C.9) 

Now, for 5 = mm{5i, 82}, consider any s E Sr satisfying \\s — s*\\i^ < 5, and expand f(s) around s* 
using the Taylor series with Lagrange's form of the remainder [|36l p. 243]: 



fis)-fis*) = 

< 



OS s=s* ^ OSOS^ 



ds 

i£supp(s)Usupp(s*) 



df{s) 



(s-s*) (C.lOa) 

s=s*+c (s — s*) 

(C.lOb) 
(C.lOc) 



(C.lOd) 



where c E (0, 1). Since the vector s* + c{s — s*) is r-sparse and satisfies ||s* + c (s — s*) — s*\\£^ < 5, the 
Hessian in (IC.lOal ) is negative- semidefinite and (IC.lObI) follows. Condition (1) of Lemma [2] and (IC.9I) imply 
that the partial derivatives in (13.71) are zero for all coordinates with indices i E supp(s) U supp(s*), and 
dC.lOdl) follows. Now, we have a 5 = mm{6i,62} > such that, for all s E Sr satisfying \\s — s*\\i,^ < S, 
/(•s) < /(•s*); therefore s* is an r-local maximum. 

If the Hessian matrix ggg^T is positive semidefinite around s*, then ggggT is negative semidefinite 
around s*. Therefore, s* is an r-local maximum of — /(s), and, by Definition [H s* is an r-local minimum 
of f{s). □ 

We are now ready to show Theorem [2l 

Proof of Theorem^- Since 0* = {s*, (cr^)*) is a fixed point of the ECME iteration, we have 

s* = argmin Q{s \ s*) = argmin \\s - [s* + H'^{HH'^)-\y - Hs*)]\\l (C.U) 

see dlTTbl) and (IX2bl) . 

We first show that the conditions of Lemma [2] hold for the function /(s) = £{s) in (13.11) and the r-sparse 
vector s* in (IC.l II) . The proof is by contradiction. Suppose that condition (1) of Lemma [2] is not satisfied, 
i.e. there exists an index i E {1,2, . . . , m} such that dim({i} Usupp(s*)) < r, but the corresponding partial 
derivative 

d£{s*) A d£{s 
dsi dsj 



(C.12) 

s=s* 
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is not zero; without loss of generality, assume that this partial derivative is positive: 

^ > 0, (C.13) 

dsi 

By the definitions of the partial derivative and limit, for the real number | ^^^f \ there exists a positive 
number 5 > such that, for all e G (0, 5), the vector = s* — e Si satisfies 

£{s^-ee,)-£{s*) d£{s^) ^ , d£{s*) ^^^^ 

2 dSi 



— e dsj 



and, therefore. 



£{s^ - e e,) < Sis*) - \ e (C.14b) 



Now, compute [see (I3.2cl) 1 

H{s* -eei\s*) -n{s''\s*) = {s* - eCi - s*f [Ira - H'^ {HH^Y^ H]{s* - eei - s*) (C.15a) 

|2 



< \\s* -ee,- s*\\l = e^ (C.15b) 



where (IC.lSbl) follows by observing that (HH^) ^ H is positive semidefinite. Therefore, we have [see 



Q{s*-eei\s*) = ^(s'^ - e e^) + - e | s*) (C.16a) 

< £{s'')-le^^i^ + nis*\s*) + e^ (C.16b) 

dsi 

\d£{s* 



where (|C.16bl) follows from (IC.14b|) and (IC.15I) . Note that the vector = s* — e ej is r-sparse. For any 

1 d£{s 

ds 



e e ( 0, min <j 5, i ^\ ' \ ] (C.17) 



we have Q{s^ \ s*) < Q{s* \ s*), which contradicts (|C.l II) . Hence, the condition (1) of Lemma [2l holds. 
The condition (2) of Lemma [2] holds because, for any s G T?.™, the Hessian of £{s) is 

p^ = 2H^{HH^)-^H (C.18) 
OS os^ 

which is clearly a positive semidefinite matrix. 

Since the conditions of Lemma [2] hold for the function /(s) = £{s) in (13.11) and fixed point s*, we 
apply Lemma [2] and conclude that s* is an r-local minimum point of £{s). Consequently, s* is an r-local 
maximum point of the concentrated marginal likelihood function (|2.6I) . which follows from the fact that 
(12. 6|) is a monotonically decreasing function of £{s) = Na'^{s), see also (|2.5I) . □ 
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Appendix D 

Proofs of Lemma [3l Lemma [H Theorem [3] and Theorem |4] 
We first show Lemma [3l 

Proof of Lemma\3} Equation (|4.2al) in part (a) follows by noting that H s = HaSa, and (|4.2bl) in part 



(a) follows by using (I4.2al) and the Rayleigh quotient property 135] Theorem 21.5.6], respectively: 



Pr.miniH) = mm pr{s,H) 

m X 1 



mm 



. sIhI{H H^)-^HaSa 
mm -. 



ylC{l,2,...,m},dim(A)=r L S^G7^'■\0rxl ll'Syl||^2 

min \^^{HI{HH'^)-'Ha) (D.l) 

Part (b) holds because the row spaces of H and G H coincide. 

The inequalities (14.41) in part (c) follow by applying the Rayleigh-quotient property and the fact that the 
projection matrix (HH^)^^ H only has eigenvalues and L When r > N, H^{HH^)^^ Ha is not 
a full-rank matrix for any index set A with dimension r; consequently, Pr.mm(-f^) = follows by using 
(I4.2bl) in part (a) of this lemma. When N = m, r-SSQ in (I4.1al) is equal to one for any < r < m, and, 
therefore, Pr,mmiH) = 1. 

In part (d), we first show that (|4.5I) implies (14.61) . When spark(if) > r, the matrix H^{H H^)^^ H a is 
positive definite for any Ad {1,2,..., m} with dim(y4) = r, and (14.61) follows by using (I4.2bl) in part (a) 
of this lemma. We now show the 'only if direction of part (d) by contradiction. Suppose that (14.61) holds 
but spark(/7) < r. By the definition of spark, there exists an index set A with dim(A) = r such that the 
columns of Ha are linearly dependent. Therefore, the minimum eigenvalue of the matrix H^{HH^)'^Ha 
is zero and (I4.2bl) implies that Pr,rain{H) = 0, which leads to contradiction. 

Finally, part (e) follows because the ri-sparse vector that minimizes p^is, H) is also r2-sparse and does 
not minimize pr^is^H) in general. □ 

Now, we prove Lemma |4l 

Proof of Lemma^- By part (d) of Lemma[3l the condition (14.101) holds if and only if spark(if) > 2r*, 
which is exactly the condition required in [8, Theorem 2] to develop the same claim about the uniqueness 
of the (Po) problem. This concludes the proof. □ 

The proof of Theorem [3] is shown as follows. 

Proof of Theorem \3\- Let s*^^^ be the estimate of s obtained in Iteration p of our ECME iteration. We 
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assume s*^^) 7^ without loss of generality; otherwise, the claim follows immediately. Now, 

1 



^""--^'-"'^ = ... „A H^iHH^)-'His^-s(P^'^)\\l (D.2a) 



< ^——Sis^P+'^) (D.2b) 

P2r,mini-n J 

^ ' Q(s(*^+^) I s(P)) - •H(s(P+^) I ] (D.2c) 



P2r,min (^) 

< ^— I (D.2d) 

< (D.2e) 

= ^——\\s'>-s'^P^-H^{HH^)-'H{s'>-s^P^)\\l (D.2f) 

P2r,minl-n j 

= ^——{s''-s^pY[I^-H^{HH^)-'H]{s^-s(p^) (D.2g) 

P2r,mini-n ) 

= — ^77n 01^' - ^^'^lll - Paris' - ^^^U) - ^^"^liy (D-2h) 

P2r,minl-n j 

< CiH) US'" - s^P% (D.2i) 

C(i^) = (D.3) 

P2r,minl-n J 

and (|D.2al) follows from the definition (I4.1al) and the fact that — s(p+^) is at most 2r-sparse since and 
s(p+i) are r-sparse; (|D.2bl) is due to the definitions (|3.1I) and (I4.1bl) . (|D.2cl) results from the identity (|3.2al) : 
(|D.2dD holds because 'H(s(p+^) | s^^)) is nonnegative [see (|3.2cl )1: (|D.2e| ) follows due to the M step of the 
ECME algorithm (IZTbl) : (1021 uses the definition (IX2bl) and the condition (l4.14aD : (|a2g])-(lE2ll) follow 
by expanding (|D.2fl) and using the definitions (I4.1al) . (I4.1bl) . and (ID.3I) . respectively. 



where 



We now apply the condition (|4.14b|) and conclude that C(-^) iii (ID. 31 ) is nonnegative and smaller than 
one: 

< C(H) = ^ ~ P^^''^'-^^'^ < 1. (D.4) 

Therefore, the sequence ||s* — s'^^-'Hf^ monotonically shrinks to zero and the claim follows. □ 
Finally, we prove Theorem IH 

Proof of Theorem^- Denote by s^p^ the sparse signal estimate in Iteration p of our ECME iteration. 
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Now, for p > 



^ \\H^{HH^y^H{s^P^')-st)h, ^^^^^ 

a/ P2r,mmiH) 



a/ P2r,min{H) 



(D.5b) 



a/ P2r,iain{H) 
a/ P2r,rmn{H) 

|< - - H^{HH^)-'[Hs^. + His'' -st) + n - + - + \\H^ {HH^)-'n\\e, 



< 



\/ P2r,raiii{H) 



(D.5e) 

a/ p2r,m.\n{H) 
a/ P2r,min(-n ) 

where (ID.Sal) follows from the definition (I4.1bl) and the fact that 5^^+^^ — is 2r-sparse; (ID.Sbl) follows 
by using (|4.15al ); in (|D.5c|) . we use the triangle inequality (||a + b\\i^ < \\a\\e,2 + 11^11^2)' definition (13.11 ). 
and the fact that the eigenvalues of (HH^)^^ H are and 1; (ID.5dl) follows along the same lines as 
(|D.2bl) - (ID.2el) with replaced by s^; (|D.5el) follows from (|3.2bl) and (I4.15ak (|D.5fl) holds due to the the 
triangle inequality and the fact that the eigenvalues of (HH^)^^ H are and 1; finally, (D.5g) follows 
from the same lines as (ID.2fl) - (ID.2il) with replaced by s*. 



From (D.5g), we can see by induction that, for p > 1 

\\s^'^ - sth, < m)f' - s^h, + 22SM^ [11^^ - + ll^'^ (M^)-^nll,,] 

V P2r,mm{-tl ) 

= [Cmf' \\s^'^ - K\U. + 2^=i_ ] ~ m^^!^, [\\s^ - KlU, + \\H^ (M^)-^nll,,] (D.6) 

where s^") is the initial signal estimate. Since the condition (|4.15cl) implies that CiH) in (|D.3I) is nonnegative 
and smaller than one [see (|D.4I) 1. we have 

lim [CiH)^/^ = (D.7) 

P /'+00 

the first term in (ID. 61 ) disappears (i.e. the effect of the initial signal estimate washes out), and the claim 
follows. □ 
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where 



Now, 



Appendix E 
Proof of Theorem [5] 

Proof of Theorem\5} When conditions (1) and (2) of Theorem [5] hold, we have spark(i7) = > 2 
and the condition of [|8l Theorem 2] is satisfied. Therefore, is the unique solution of the (Pq) problem, 
according to (HI Theorem 2]. We now consider the USS function under different sparsity level r. 

For r = r*, the ML estimate of 6 is ^mlI'^*) = (^ml('"*)) S'^ilI^*)) = (•s^, 0) and unique, since it leads 
to infinite likelihood function (I2.4al) and no other yields infinite likelihood, due to the fact that is the 
unique solution of the (Pq) problem. Furthermore, since (|6.2|) holds, we have N — r^ — 2 > and therefore 
USS(r) is infinite as well. Note that 

USS(r) = USS(r,a^,Jr)) (E.l) 
^ 4 ^ In (^) - 1 (iV - . - 2) m (^^^^^p^) . (E.2) 

lim ^^^^^^ = |(iV-r*-2)>0 (E.3) 
\o ln(l/cr2) 2 V ^ 

specifying the rate of growth to infinity of USS(r*, cr^) as cr^ approaches the ML estimate S'mlI^*) — 0- 
For r < r'^, y ^ H s for any r-sparse vector s; consequently, crli^^ir) > and USS(r) is finite. 
For r > r"^, the ML estimate of cr^ must be al^^{r) = 0, which leads to infinite likelihood. However, in 

this case. 

Therefore, if r > — 2, USS(r) is either finite or goes to negative infinity. For < r < N — 2, USS(r) 
is infinitely large, but the rate at which USS(r, cr^) grows to infinity as cr^ approaches the ML estimate 
^Mhi^) = is smaller than that specified by (|E.3I) . 

The claim follows by combining the above conclusions. □ 
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